%%HTML
<script src="require.js"></script>
from IPython.display import HTML, display, display_html
HTML("""<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""")
HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>
''')
#!pip install transformers
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
# Local only (change to Jojie directory if needed)
os.environ['XDG_CACHE_HOME'] = '/home/msds2023/calbao/.cache'
os.environ['HUGGINGFACE_HUB_CACHE'] = '/home/msds2023/calbao/.cache'
import torch
# Set device to GPU if available, otherwise use CPU
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print('Running on device: {}'.format(device))
Running on device: cpu
import numpy as np
import io
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import seaborn as sns
from tqdm.notebook import tqdm
from moviepy.editor import VideoFileClip, ImageSequenceClip
import torch
from facenet_pytorch import (MTCNN)
from transformers import (AutoFeatureExtractor,
AutoModelForImageClassification,
AutoConfig)
from PIL import Image, ImageDraw
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import skimage.io as sio
import numpy as np
from moviepy.editor import VideoFileClip, ImageSequenceClip
import glob
from amber import *
In recent years, the rapid progress of artificial intelligence (AI) has significantly impacted various sectors of society. While its presence in industries like healthcare, finance, and transportation is widely recognized, the potential of AI in the arts, particularly in cinema, is yet to be fully explored.
Facial expressions, dubbed as the universal language of emotion, play a pivotal role in storytelling, especially in cinema. The subtleties of an actor's performance, expressed through their countenance, can breathe life into a character, tug at audience's heartstrings, and etch memorable moments in the collective psyche.
Taking us back to the era of silent films, one cannot help but marvel at the depth of emotion conveyed solely through facial expressions. A notable example is the classic "The Gold Rush", directed by Charlie Chaplin. This masterpiece showcases the effectiveness of facial expressions in narrating a compelling story, demonstrating a facet of cinema that resonates profoundly even in the absence of dialogue.
The nuances of facial expressions in conveying emotions is crucial in contemporary cinema as well. In this study titled "Eme or Emotional? Leveraging Facial Recognition and Explainable AI for Compelling Performances in Pinoy Cinema", we explore the role of AI in dissecting and enhancing the quality of such performances. Through a combination of Facial Recognition and Explainable AI, we aim to provide a unique lens to examine and learn from the rich tapestry of emotions in Pinoy cinema.
The motivation for our study stems from the obstacles aspiring actors encounter in their pursuit of proficiency. Expert coaching, often seen as a cornerstone of skill development, is not universally accessible. Furthermore, the evaluation of performances is subjective and can vary significantly, leading to inconsistent feedback and potential confusion for actors.
Self-directed learners may opt to study great performances for inspiration, but the enormous volume of potential resources makes this task overwhelming. Therefore, the acting community needs an accessible, objective, and efficient evaluation method that can be used as a universal learning tool.
Our study seeks to address these challenges by leveraging Facial Recognition and Explainable AI. This combination aims to standardize evaluations, democratize access to quality coaching resources, and make the process of learning from extensive film performances more practical.
This research aims to address the problem: "How can artificial intelligence and explainability techniques be utilized to offer objective and measurable insights to budding actors, thereby enhancing the quality of their craft in the Philippine cinema industry?" This endeavor aims to formulate a solution that will serve as a guide for aspiring actors in their professional development.
For the purposes of this study, we sought to analyze a diverse range of emotions as depicted in Philippine cinema. In order to achieve this, we chose to focus on critically acclaimed films from a variety of genres, including drama, romance, and horror. These films were selected not only for their quality and recognition, but also for their display of a wide array of emotions. In particular, the research makes use of video clips extracted from four notable Filipino films:
To train the explainer model, a custom dataset was compiled through a Google Images scraping procedure. The code for interacting with this API is documented in supp-scrape-dataset.ipynb. This dataset captures the full gamut of human emotions, classified into seven key categories: angry, disgust, fear, happy, neutral, sad, and surprise. This dataset was then split into train, validation, and test sets in supp-split-dataset.ipynb as preparation for training the explainer model. The decision to custom-build this dataset stems from the need to have a robust representation of emotions, crucial for the effective functioning of our model.
The methodology employed in this study comprises a three-stage pipeline leveraging state-of-the-art pre-trained models, each serving a specific function in our automated content analysis system.
Figure 1. Methodology
Our methodology provides a systematic approach to objectively analyze acting performances, using a combination of advanced AI techniques. This pipeline allows us to dissect acting performances with precision and offer measurable insights, thus democratizing access to high-quality acting resources and enhancing the potential of Philippine cinema.
The first stage of our pipeline is face detection in film clips, for which we leverage the Multi-Task Cascaded Convolutional Networks (MTCNN) [1]. MTCNN is an efficient deep learning model designed for face detection and alignment, using a three-stage cascading process to generate and refine candidate facial windows, and subsequently identify facial features accurately. It was introduced by Zhang, et al. in the 2016 IEEE Signal Processing Letters conference [2].
Figure 2. Multi-Task Cascaded Convolutional Networks (MTCNN) Architecture
# Initialize MTCNN model for single face cropping
mtcnn = MTCNN(
image_size=160,
margin=0,
min_face_size=200,
thresholds=[0.6, 0.7, 0.7],
factor=0.709,
post_process=True,
keep_all=False,
device=device
)
# Load the pre-trained model and feature extractor
extractor = AutoFeatureExtractor.from_pretrained(
"trpakov/vit-face-expression"
)
model = AutoModelForImageClassification.from_pretrained(
"trpakov/vit-face-expression"
)
2023-06-14 16:34:40.284248: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
The second stage involves emotion recognition where we employ the ViT-Face-Expression model. The Vision Transformer (ViT), a game-changer in visual data analysis, adopts the Transformer architecture used in natural language processing for images, treating them as sequences of patches. By leveraging self-attention mechanisms, VIT captures local and global patterns, allowing it to match or even exceed traditional Convolutional Neural Network (CNN) models in certain computer vision tasks. Dosovitskiy, et al. presented this paper on ViT during the 2022 Conference on Neural Information Processing Systems [3].
Figure 3. Vision Transformer (ViT) Architecture
def detect_emotions_single(image, vid_img, clr):
"""
Detect emotions from a given image, displays the detected face and the
emotion probabilities in a bar plot.
Parameters
----------
image (PIL.Image): The input image.
Returns
-------
PIL.Image: The cropped face from the input image.
"""
# Create a copy of the image to draw on
temporary = image.copy()
# Use the MTCNN model to detect faces in the image
sample = mtcnn.detect(temporary)
# If a face is detected
if sample[0] is not None:
# Get the bounding box coordinates of the face
box = sample[0][0]
# Crop the detected face from the image
face = temporary.crop(box)
# Pre-process the cropped face to be fed into the emotion detection model
inputs = extractor(images=face, return_tensors="pt")
# Pass the pre-processed face through the model to get emotion predictions
outputs = model(**inputs)
# Apply softmax to the logits to get probabilities
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Retrieve the id2label attribute from the configuration
id2label = AutoConfig.from_pretrained(
"trpakov/vit-face-expression"
).id2label
# Convert probabilities tensor to a Python list
probabilities = probabilities.detach().numpy().tolist()[0]
# Map class labels to their probabilities
class_probabilities = {id2label[i]: prob for i,
prob in enumerate(probabilities)}
# Prepare a figure with 2 subplots: one for the face image, one for the bar plot
import pandas as pd
df_plot = pd.DataFrame({i:[j] for i, j in class_probabilities.items()})
# radar
categories = [i.upper() for i in df_plot.columns]
categories = [*categories, categories[0]]
figm = px.imshow(vid_img)
fig = make_subplots(
rows=2, cols=1, vertical_spacing=0.08,
specs=[[{"type": "image"}], [{"type": "scatterpolar"}]])
fig.add_trace(figm.data[0], 1, 1)
r_ = df_plot.iloc[0].values.tolist()
r_ = [*r_, r_[0]]
fig.add_trace(go.Scatterpolar(
r=r_,
theta=categories,
fill='toself',
#name=str(df_plot.index[cluster-1]),
line_color=clr,
opacity=0.7),2,1)
fig.update_layout(template=None, plot_bgcolor="#FFFFFF",
paper_bgcolor="#FFFFFF",#width=1000,
height=1000,
polar=dict(radialaxis=dict(angle=90,
tick0=1,
dtick=0.5,
range=[-0.4, 1],
tickangle=90,
titlefont={"size": 15, }),
angularaxis=dict(rotation=90,
tickfont={"size": 15})),
showlegend=False)
fig.update_layout(xaxis1_visible=False, yaxis1_visible=False)
fig.show(config={
"editable": True,
'toImageButtonOptions': {
'format': 'png', # one of png, svg, jpeg, webp
'filename': 'ml3_fp',
'scale': 5 # Multiply title/legend/axis/canvas sizes by this factor
}
})
frame = video_data[350]
# Convert the frame to a PIL image and display it
image = Image.fromarray(frame)
detect_emotions_single(image, video_data[350], '#B82E24')
def detect_emotions(image, vid_img):
"""
Detect emotions from a given image.
Returns a tuple of the cropped face image and a
dictionary of class probabilities.
"""
temporary = image.copy()
# Detect faces in the image using the MTCNN group model
sample = mtcnn.detect(temporary)
if sample[0] is not None:
box = sample[0][0]
# Crop the face
face = temporary.crop(box)
# Pre-process the face
inputs = extractor(images=face, return_tensors="pt")
# Run the image through the model
outputs = model(**inputs)
# Apply softmax to the logits to get probabilities
probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Retrieve the id2label attribute from the configuration
config = AutoConfig.from_pretrained("trpakov/vit-face-expression")
id2label = config.id2label
# Convert probabilities tensor to a Python list
probabilities = probabilities.detach().numpy().tolist()[0]
# Map class labels to their probabilities
class_probabilities = {
id2label[i]: prob for i, prob in enumerate(probabilities)
}
return vid_img, class_probabilities
return None, None
def plotly_fig2array(fig):
#convert Plotly fig to an array
fig_bytes = fig.to_image(format="png", scale=1)
buf = io.BytesIO(fig_bytes)
img = Image.open(buf)
return np.asarray(img)
def create_combined_image(face, vid_img, class_probabilities, clr):
"""
Create an image combining the detected face and a barplot
of the emotion probabilities.
Parameters:
face (PIL.Image): The detected face.
class_probabilities (dict): The probabilities of each
emotion class.
Returns:
np.array: The combined image as a numpy array.
"""
# Prepare a figure with 2 subplots: one for the face image, one for the bar plot
import pandas as pd
df_plot = pd.DataFrame({i:[j] for i, j in class_probabilities.items()})
# radar
categories = [i.upper() for i in df_plot.columns]
categories = [*categories, categories[0]]
figm = px.imshow(vid_img)
fig = make_subplots(
rows=2, cols=1, vertical_spacing=0.08,
specs=[[{"type": "image"}], [{"type": "scatterpolar"}]])
fig.add_trace(figm.data[0], 1, 1)
r_ = df_plot.iloc[0].values.tolist()
r_ = [*r_, r_[0]]
fig.add_trace(go.Scatterpolar(
r=r_,
theta=categories,
fill='toself',
#name=str(df_plot.index[cluster-1]),
line_color=clr,
opacity=0.7),2,1)
fig.update_layout(template=None, plot_bgcolor="#FFFFFF",
paper_bgcolor="#FFFFFF",#width=1000,
height=1000,
polar=dict(radialaxis=dict(angle=90,
tick0=1,
dtick=0.5,
range=[-0.4, 1],
tickangle=90,
titlefont={"size": 15, }),
angularaxis=dict(rotation=90,
tickfont={"size": 15})),
showlegend=False)
fig.update_layout(xaxis1_visible=False, yaxis1_visible=False)
# Convert the figure to a numpy array
img = plotly_fig2array(fig)
return img
The final stage of our pipeline involves explainability. To interpret the decision-making process of our VIT model, we use an Explanandum-Explainer framework. The Explainer model generates class-specific attribution masks, pinpointing the regions of the actor's face that VIT pays attention to when classifying emotions. These explainability techniques give us a deeper understanding of the AI's behavior and provide insight into which facial features significantly convey different emotions. The reference algorithm and architecture was developed by Stalder, et al. and was presented in the 2022 Conference on Neural Information Processing Systems [4]. We adapted their algorithm found on Github to our specific needs for the research.
Figure 4. Explanandum-Explainer Architecture
In this section, we present the findings and insights derived from our study on emotion recognition in select movie scenes and the attributions resulting from our explainable AI model. Through an in-depth analysis and interpretation of the collected data, we aim to unravel the impact of AI on the understanding and portrayal of emotions in Pinoy cinema, while also exploring the transparency and interpretability of our AI model. The following discussions provide a comprehensive analysis of the results obtained and the subsequent discussions that arise from them.
Our analysis concentrated on a diverse selection of movie scenes sourced from various Pinoy films. To accurately determine the emotions portrayed by the actors, we employed sophisticated facial recognition algorithms, specifically MTCNN and VIT-Face-Expression. MTCNN aided in detecting faces, while VIT-Face-Expression facilitated the classification of the displayed emotions. The data we collected encompasses a broad range of emotional states, such as happiness, sadness, anger, fear, and disgust.
Now, let's delve into each of the emotions depicted in the chosen movie scenes individually.
# Load your video
four_sisters_video = 'videos/four_sisters.mp4'
clip = VideoFileClip(four_sisters_video)
vid_fps = clip.fps
print(f"Video fps: {vid_fps}")
# Get the video (as frames)
video = clip.without_audio()
video_data = np.array(list(video.iter_frames()))
Video fps: 23.976023976023978
skips = 1
reduced_video = []
for i in tqdm(range(0, len(video_data), skips)):
reduced_video.append(video_data[i])
0%| | 0/384 [00:00<?, ?it/s]
# Define a list of emotions
emotions = ["angry", "disgust", "fear", "happy", "neutral", "sad", "surprise"]
# List to hold the combined images
combined_images = []
# Create a list to hold the class probabilities for all frames
all_class_probabilities = []
# Loop over video frames
for i, frame in tqdm(enumerate(reduced_video),
total=len(reduced_video),
desc="Processing frames"):
# Convert frame to uint8
frame = frame.astype(np.uint8)
# Call detect_emotions to get face and class probabilities
face, class_probabilities = detect_emotions(Image.fromarray(frame), frame)
# If a face was found
if face is not None:
# Create combined image for this frame
combined_image = create_combined_image(face, frame, class_probabilities, '#ffbd59')
# Append combined image to the list
combined_images.append(combined_image)
else:
# If no face was found, set class probabilities to None
class_probabilities = {emotion: None for emotion in emotions}
# Append class probabilities to the list
all_class_probabilities.append(class_probabilities)
Processing frames: 0%| | 0/384 [00:00<?, ?it/s]
# Convert list of images to video clip
clip_with_plot = ImageSequenceClip(combined_images,
fps=vid_fps/skips) # Choose the frame rate (fps) according to your requirement
# Display the clip
clip_with_plot.ipython_display(width=900)
Moviepy - Building video __temp__.mp4. Moviepy - Writing video __temp__.mp4
Moviepy - Done ! Moviepy - video ready __temp__.mp4
Figure 5. Radar Plot of Frame-by-Frame Emotions (Four Sisters and a Wedding)
Four Sisters and a Wedding is a captivating film that delves into the intricate emotional dynamics within the Salazar family. It masterfully portrays a wide array of emotions, including joy, love, jealousy, regret, and reconciliation, resulting in a deeply resonant cinematic experience. One particularly memorable and "memeable" scene in the movie involves the revelation that Teddie has not found success in her life abroad. This scene effectively conveys a spectrum of emotions.
As the secret is unveiled, Teddie experiences a combination of fear and sadness, anxiously anticipating her mother's reaction. Her emotions are palpable as she navigates the uncertainty of how her revelation will be received. Simultaneously, Bobbie, upon witnessing Teddie's confession of jealousy, displays a mixture of confusion, with facial expressions blending elements of happiness, anger, and sadness. This emotionally charged scene highlights the complexity and depth of the characters' feelings, leaving a lasting impact on the viewers.
clip_with_plot.write_videofile("videos/v_emotion_foursisters.mp4")
Moviepy - Building video v_emotion_foursisters.mp4. Moviepy - Writing video v_emotion_foursisters.mp4
Moviepy - Done ! Moviepy - video ready v_emotion_foursisters.mp4
# Load your video
tadhana_video = 'videos/tadhana.mp4'
clip = VideoFileClip(tadhana_video)
vid_fps = clip.fps
print(f"Video fps: {vid_fps}")
# Get the video (as frames)
video = clip.without_audio()
video_data = np.array(list(video.iter_frames()))
skips = 1
reduced_video = []
for i in tqdm(range(0, len(video_data), skips)):
reduced_video.append(video_data[i])
Video fps: 30.0
0%| | 0/471 [00:00<?, ?it/s]
# Define a list of emotions
emotions = ["angry", "disgust", "fear", "happy", "neutral", "sad", "surprise"]
# List to hold the combined images
combined_images = []
# Create a list to hold the class probabilities for all frames
all_class_probabilities = []
# Loop over video frames
for i, frame in tqdm(enumerate(reduced_video),
total=len(reduced_video),
desc="Processing frames"):
# Convert frame to uint8
frame = frame.astype(np.uint8)
# Call detect_emotions to get face and class probabilities
face, class_probabilities = detect_emotions(Image.fromarray(frame), frame)
# If a face was found
if face is not None:
# Create combined image for this frame
combined_image = create_combined_image(face, frame, class_probabilities, '#B82E24')
# Append combined image to the list
combined_images.append(combined_image)
else:
# If no face was found, set class probabilities to None
class_probabilities = {emotion: None for emotion in emotions}
# Append class probabilities to the list
all_class_probabilities.append(class_probabilities)
Processing frames: 0%| | 0/471 [00:00<?, ?it/s]
from moviepy.editor import VideoFileClip, ImageSequenceClip
# Convert list of images to video clip
clip_with_plot = ImageSequenceClip(combined_images,
fps=vid_fps/skips) # Choose the frame rate (fps) according to your requirement
# Display the clip
clip_with_plot.ipython_display(width=900)
Moviepy - Building video __temp__.mp4. Moviepy - Writing video __temp__.mp4
Moviepy - Done ! Moviepy - video ready __temp__.mp4
Figure 6. Radar Plot of Frame-by-Frame Emotions (That Thing Called Tadhana)
The film That Thing Called Tadhana has become synonymous with its ability to evoke deep emotions and resonate with viewers through its relatable "hugot feels." It follows the journey of two individuals who are grappling with heartbreak and embarking on a quest of self-discovery, exploring the intense emotions associated with love and breakups.
In another iconic scene, the protagonists, Mace and Anthony, engage in a conversation about the expectations placed upon them during this phase of their lives. Mace, with a mixture of happiness and sadness apparent on her face, acknowledges the contrast between the facade of greatness they are expected to embody and their true emotional state. She raises a toast, saying, "To the great people that we are today," aware of the bittersweet reality they face. On the other hand, Anthony playfully responds with a smile on his face, saying "Sinungaling!", reflecting a sense of happiness and lightheartedness in their banter.
This particular scene captures the nuanced emotions experienced by the characters, juxtaposing the societal expectations they confront with the underlying truth of their personal struggles. The interplay of happiness, sadness, and playfulness adds depth to their interaction and contributes to the film's poignant exploration of love and self-acceptance.
clip_with_plot.write_videofile("videos/v_emotion_tadhana.mp4")
Moviepy - Building video v_emotion_tadhana.mp4. Moviepy - Writing video v_emotion_tadhana.mp4
Moviepy - Done ! Moviepy - video ready v_emotion_tadhana.mp4
# Load your video
fengshui_video = 'videos/feng_shui.mp4'
clip = VideoFileClip(fengshui_video)
vid_fps = clip.fps
print(f"Video fps: {vid_fps}")
# Get the video (as frames)
video = clip.without_audio()
video_data = np.array(list(video.iter_frames()))
skips = 1
reduced_video = []
for i in tqdm(range(0, len(video_data), skips)):
reduced_video.append(video_data[i])
Video fps: 29.97002997002997
0%| | 0/518 [00:00<?, ?it/s]
# Define a list of emotions
emotions = ["angry", "disgust", "fear", "happy", "neutral", "sad", "surprise"]
# List to hold the combined images
combined_images = []
# Create a list to hold the class probabilities for all frames
all_class_probabilities = []
# Loop over video frames
for i, frame in tqdm(enumerate(reduced_video),
total=len(reduced_video),
desc="Processing frames"):
# Convert frame to uint8
frame = frame.astype(np.uint8)
# Call detect_emotions to get face and class probabilities
face, class_probabilities = detect_emotions(Image.fromarray(frame), frame)
# If a face was found
if face is not None:
# Create combined image for this frame
combined_image = create_combined_image(face, frame, class_probabilities, '#312b2c')
# Append combined image to the list
combined_images.append(combined_image)
else:
# If no face was found, set class probabilities to None
class_probabilities = {emotion: None for emotion in emotions}
# Append class probabilities to the list
all_class_probabilities.append(class_probabilities)
Processing frames: 0%| | 0/518 [00:00<?, ?it/s]
from moviepy.editor import VideoFileClip, ImageSequenceClip
# Convert list of images to video clip
clip_with_plot = ImageSequenceClip(combined_images,
fps=vid_fps/skips) # Choose the frame rate (fps) according to your requirement
# Display the clip
clip_with_plot.ipython_display(width=900)
Moviepy - Building video __temp__.mp4. Moviepy - Writing video __temp__.mp4
Moviepy - Done ! Moviepy - video ready __temp__.mp4
Figure 7. Radar Plot of Frame-by-Frame Emotions (Feng Shui)
The 2004 movie Feng Shui stands as a remarkable cinematic achievement, captivating audiences with its compelling narrative and its profound ability to evoke deep emotions. Through its gripping storytelling, the film delves into the intricate realms of fate, superstition, and the enigmatic forces that intricately interweave with ordinary existence.
Within one particular scene, we witness Joy's face filled with a potent mixture of fear and sadness as she comes to a startling realization: she is confronted with apparitions of Inton, Denton, Ingrid, and Thelma. The weight of this eerie encounter is palpable as Joy's emotions are laid bare, reflecting her genuine terror and the overwhelming sadness that accompanies the apparitions' presence.
This pivotal scene epitomizes the film's ability to evoke intense emotions within its viewers. Joy's expression of fear and sadness serves as a conduit for the audience's own apprehension, drawing them further into the narrative's suspenseful depths. The scene encapsulates the haunting nature of the movie, leaving a lasting impact on the audience as they navigate the intricate web of destiny and the supernatural elements that define the story.
clip_with_plot.write_videofile("videos/v_emotion_fengshui.mp4")
Moviepy - Building video v_emotion_fengshui.mp4. Moviepy - Writing video v_emotion_fengshui.mp4
Moviepy - Done ! Moviepy - video ready v_emotion_fengshui.mp4
# Load your video
himala_video = 'videos/himala_.mp4'
clip = VideoFileClip(himala_video)
vid_fps = clip.fps
print(f"Video fps: {vid_fps}")
# Get the video (as frames)
video = clip.without_audio()
video_data = np.array(list(video.iter_frames()))
skips = 1
reduced_video = []
for i in tqdm(range(0, len(video_data), skips)):
reduced_video.append(video_data[i])
Video fps: 30.0
0%| | 0/355 [00:00<?, ?it/s]
# Define a list of emotions
emotions = ["angry", "disgust", "fear", "happy", "neutral", "sad", "surprise"]
# List to hold the combined images
combined_images = []
# Create a list to hold the class probabilities for all frames
all_class_probabilities = []
# Loop over video frames
for i, frame in tqdm(enumerate(reduced_video),
total=len(reduced_video),
desc="Processing frames"):
# Convert frame to uint8
frame = frame.astype(np.uint8)
# Call detect_emotions to get face and class probabilities
face, class_probabilities = detect_emotions(Image.fromarray(frame), frame)
# If a face was found
if face is not None:
# Create combined image for this frame
combined_image = create_combined_image(face, frame, class_probabilities, '#5271ff')
# Append combined image to the list
combined_images.append(combined_image)
else:
# If no face was found, set class probabilities to None
class_probabilities = {emotion: None for emotion in emotions}
# Append class probabilities to the list
all_class_probabilities.append(class_probabilities)
Processing frames: 0%| | 0/355 [00:00<?, ?it/s]
from moviepy.editor import VideoFileClip, ImageSequenceClip
# Convert list of images to video clip
clip_with_plot = ImageSequenceClip(combined_images,
fps=vid_fps/skips) # Choose the frame rate (fps) according to your requirement
# Display the clip
clip_with_plot.ipython_display(width=900)
Moviepy - Building video __temp__.mp4. Moviepy - Writing video __temp__.mp4
Moviepy - Done ! Moviepy - video ready __temp__.mp4
Figure 8. Radar Plot of Frame-by-Frame Emotions (Himala)
Himala holds an esteemed position in the realm of Filipino cinema, recognized as one of its greatest contributions. The film delves profoundly into the intricate web of human emotions, specifically exploring the profound connection between faith and desperation.
One particular scene from Himala has etched itself into the collective memory of viewers. As Elsa utters the iconic line, "Walang himala!", her face becomes a canvas displaying a range of powerful emotions. Sadness, anger, and disgust intertwine, etching their mark upon her countenance.
Elsa's visage mirrors the depth of her emotions, encapsulating the weight of her struggles and disillusionment. The scene resonates with the audience, who are drawn into the vortex of conflicting feelings that Elsa experiences. It serves as a poignant reflection of the complexities of faith, the fragility of human spirit, and the disheartening realities that can test one's beliefs.
This powerful scene exemplifies the film's ability to evoke profound emotions and provoke introspection. Through Elsa's expressions of sadness, anger, and disgust, Himala immerses viewers in the turbulent emotional journey of its characters and invites contemplation on the nature of faith and the harsh realities of life.
clip_with_plot.write_videofile("videos/v_emotion_himala.mp4")
Moviepy - Building video v_emotion_himala.mp4. Moviepy - Writing video v_emotion_himala.mp4
Moviepy - Done ! Moviepy - video ready v_emotion_himala.mp4
The transparency and interpretability of our AI model were pivotal in thoroughly analyzing the factors influencing its decision-making process. By utilizing explainable AI techniques, we aimed to identify the significant facial features and cues that drove the model's predictions. Through a comprehensive attribution analysis, we gained valuable insights into the relative significance of various facial regions and expressions in determining the recognized emotions.
Now, let's delve into the attributed facial features for the diverse movie scenes we have previously explored.
data_module = CustomDataModule(
'datasets/data/classifier/'
)
num_classes = 7
model = ExplainerClassifierModel(
num_classes=num_classes,
use_mask_variation_loss=False,
use_mask_area_loss=True
)
/opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /opt/conda/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`. warnings.warn(msg) 2023-06-14 17:48:08.525379: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. /home/msds2023/pestrada/.local/lib/python3.10/site-packages/transformers/models/vit/feature_extraction_vit.py:28: FutureWarning: The class ViTFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ViTImageProcessor instead. warnings.warn(
log_dir = "tb_logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
logger = pl.loggers.TensorBoardLogger(log_dir, name="NN Explainer")
early_stop_callback = EarlyStopping(
monitor="val_loss",
min_delta=0.001,
patience=5,
verbose=False,
mode="min",
)
trainer = pl.Trainer(
logger=logger,
callbacks=[early_stop_callback],
max_epochs=300
)
GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs
train_model = True
data_module.setup('test')
if train_model:
torch.autograd.set_detect_anomaly(True)
trainer.fit(model=model, datamodule=data_module)
trainer.test(dataloaders=data_module.test_dataloader())
else:
trainer.test(model=model, datamodule=data_module)
Means: [0.6235189 0.4881056 0.4277293] Std. Deviations: [0.25944903 0.22985741 0.22180712]
Missing logger folder: tb_logs/fit/20230614-174812/NN Explainer
Means: [0.6235189 0.4881056 0.4277293] Std. Deviations: [0.25944903 0.22985741 0.22180712]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params --------------------------------------------------------------------------- 0 | explainer | Deeplabv3Resnet50ExplainerModel | 39.6 M 1 | classifier | ViTForImageClassification | 85.8 M 2 | total_variation_conv | TotalVariationConv | 18 3 | classification_loss_fn | CrossEntropyLoss | 0 4 | train_metrics | SingleLabelMetrics | 0 5 | valid_metrics | SingleLabelMetrics | 0 6 | test_metrics | SingleLabelMetrics | 0 --------------------------------------------------------------------------- 39.6 M Trainable params 85.8 M Non-trainable params 125 M Total params 501.757 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]
/home/msds2023/pestrada/.local/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:280: PossibleUserWarning: The number of training batches (13) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. rank_zero_warn(
Training: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
/home/msds2023/pestrada/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py:148: UserWarning: `.test(ckpt_path=None)` was called without a model. The best model of the previous `fit` call will be used. You can pass `.test(ckpt_path='best')` to use the best model or `.test(ckpt_path='last')` to use the last model. If you pass a value, this warning will be silenced. rank_zero_warn( Restoring states from the checkpoint path at tb_logs/fit/20230614-174812/NN Explainer/version_0/checkpoints/epoch=85-step=1118.ckpt
Means: [0.6235189 0.4881056 0.4277293] Std. Deviations: [0.25944903 0.22985741 0.22180712]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Loaded model weights from the checkpoint at tb_logs/fit/20230614-174812/NN Explainer/version_0/checkpoints/epoch=85-step=1118.ckpt
Testing: 0it [00:00, ?it/s]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ test_Accuracy │ 0.9606626033782959 │ │ test_F-Score │ 0.9354895353317261 │ │ test_Precision │ 0.9629977345466614 │ │ test_Recall │ 0.9128401875495911 │ │ test_loss │ 0.6426100134849548 │ └───────────────────────────┴───────────────────────────┘
import glob
from PIL import Image
emo = {
0: "angry",
1: "disgust",
2: "fear",
3: "happy",
4: "neutral",
5: "sad",
6: "surprise"
}
for img_path in glob.glob('datasets/patticakes/*'):
image = Image.open(img_path).convert('RGB')
test_transforms = transforms.Compose([
transforms.Resize(size=(224,224)),
transforms.ToTensor()])
image_tensor = test_transforms(image)
labels = model.classifier(image_tensor.unsqueeze(0)).logits
labels = F.softmax(labels, dim=1)
targets = ((labels > (1/7))*1).nonzero()[:, -1].flatten().detach().numpy()
fig, ax = plt.subplots(2, 3, figsize=(6.4*2, 4.8*1.5))
fig.suptitle(f'{img_path.split("/")[-1]}')
ax = ax.flatten()
for i, (target, label) in enumerate(zip(targets, labels.flatten()[targets])):
ax[i].imshow(image_tensor.squeeze().permute(1, 2, 0))
ax[i].imshow(mask.squeeze()[target].sigmoid(), cmap='coolwarm',
alpha=0.4)
ax[i].set_title(f'{emo[target]} = {label*100:.2f}%')
ax[i].set_axis_off()
for i in range(1, len(ax) - len(targets) + 1):
fig.delaxes(ax[-i])
plt.show()
Figure 9. Detected Emotion Probabilities with Region Attributions
Some notable attributions from these results are as follows.
These emotions serve as powerful tools to engage the audience, highlighting the characters' internal struggles, conflicts, and journeys. Through the nuanced portrayal of these emotions, the films evoke a range of responses from viewers, immersing them in the emotional landscapes and thematic explorations of love, family, faith, and self-discovery. The combination of these emotions with the storytelling elements creates a captivating cinematic experience, leaving a lasting impact on the audience long after the credits roll.
Our study demonstrates the significant potential of AI and explainability techniques in the development of budding actors in the Philippine cinema industry. The results highlight the importance of emotional complexity in scenes, which were found to be more engaging. By dissecting memorable scenes from renowned Filipino films, we offer objective and measurable insights to enhance acting craft.
The use of pretrained models streamlined our process and enhanced accuracy, demonstrating the effectiveness of such models in addressing the challenges of data scarcity, limited processing power, and time constraints. These models can be fine-tuned to specific tasks, offering a practical solution for efficient and large-scale emotion analysis in films.
For novice actors pursuing self-study, our research provides a valuable tool to help navigate through vast resources and extract key insights. In addition to providing a supplementary resource to traditional coaching, our project democratizes access to high-quality training resources and bridges the disparity gap in the industry.
Overall, by leveraging AI and explainability methods, we pave the way for a more standardized, objective, and accessible training process in the Philippine cinema industry, contributing to its overall growth and quality enhancement.
This study is just a stepping stone towards the utilization of AI in improving acting performances and offers several avenues for future research:
[1] Chobanyan, M. (2019). Training an emotion detector with transfer learning. towardsdatascience. URL.
[2] Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10), 1499-1503. URL.
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. 2022 Conference on neural information processing systems. URL.
[4] Stalder, S., Perraudin, N., Achanta, R., Perez-Cruz, F., & Volpi, M. (2022). What You See is What You Classify: Black Box Attributions. 2022 Conference on Neural Information Processing Systems. URL.